126
Applications in Natural Language Processing
FIGURE 5.4
The overview of algorithm prpoposed in [118].
dense matrix of multi-head self-attention is treated as a group. As a result, there will be 12
groups since there are 12 heads. Then, in each group, they bucket sequential output neurons
together as sub-groups, e.g., each N output neurons as one sub-group. Consequently, there
are 12× 64
N sub-group in total (the hidden dim in each head of Bert-base is 768
12 = 64). Now,
each subgroup has its own quantization range. Fig. 5.6 presents an illustration. Here Nh
FIGURE 5.5
Top eigenvalue distributions for different encoder layers for various datasets including SST-
2, MNLI, CoNNL-03, and SQuAD. The middle layers generally have higher mean values
and larger variance than the others. The last three layers have the smallest variance and
mean values among all layers.
FIGURE 5.6
The overview of group-wise quantization method proposed in [209]. Here Nh (number of
heads) value matrices Wv are concatenated together, resulting in a 3-d tensor. The same
color denotes the same group with a shared quantization range.